Data Description: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. Domain: Object recognition Context: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. Attribute Information: ● All the features are geometric features extracted from the silhouette. ● All are numeric in nature.
Learning Outcomes: ● Exploratory Data Analysis ● Reduce number dimensions in the dataset with minimal information loss ● Train a model using Principal Components
Objective: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using raw data.
#importing some necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
dataset = pd.read_csv('vehicle.csv')
dataset.head()
dataset.shape
dataset.describe().transpose()
dataset.dtypes
# Finding number of records for each unique target variable values. Note the target variable is class
dataset['class'].value_counts()
#Boxplot to understand spread and outliers
dataset.plot(kind='box', figsize=(20,10))
dataset.hist(figsize=(15,15))
# Checking for null values in all the attributes
dataset.isnull().sum()
#Replace missing values by median of column value as their are many outliers otherwise we would replace it with means
for i in dataset.columns[:17]:
median_value = dataset[i].median()
dataset[i] = dataset[i].fillna(median_value)
dataset.info()
# chercking for null values again to ensure there are no null values present
dataset.isnull().sum()
# Identifying the outliers and replacing them with median
for col_name in dataset.columns[:-1]:
q1 = dataset[col_name].quantile(0.25)
q3 = dataset[col_name].quantile(0.75)
iqr = q3 - q1
low = q1-1.5*iqr
high = q3+1.5*iqr
dataset.loc[(dataset[col_name] < low) | (dataset[col_name] > high), col_name] = dataset[col_name].median()
dataset.describe().transpose()
# checking whether outliers fixed or not
dataset.plot(kind='box', figsize=(20,10))
hdf=dataset.copy()
for feature in hdf.columns: # Loop through all columns in the dataframe
if hdf[feature].dtype == 'object': # Only apply for columns with categorical strings
hdf[feature] = pd.Categorical(hdf[feature]).codes # Replace strings with an integer
#importing seaborn for statistical plots
import seaborn as sns
sns.pairplot(hdf, size=7,aspect=0.5 , diag_kind='kde')
# From analysing the diagonal panels we can find the number of clusters that can be taken by observing the highest number number of peaks and from off diagonal panels we can choose the distance calculation method that can be used. In this case we choose number of clusters as 3.
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters=3, affinity='manhattan', linkage='average')
model.fit(hdf)
hdf['labels'] = model.labels_
hdf.groupby(["labels"]).count()
hdf_clusters = hdf.groupby(['labels'])
print(hdf_clusters)
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist #Pairwise distribution between data points
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering
# using average linkage
Z = linkage(hdf, 'average')
c, coph_dists = cophenet(Z , pdist(hdf))
c
plt.figure(figsize=(10, 10))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
# Using complete linkage
Z = linkage(hdf, 'complete')
c, coph_dists = cophenet(Z , pdist(hdf))
c
plt.figure(figsize=(15, 15))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold=90, leaf_font_size=10. )
plt.tight_layout()
# linkage as ward
Z = linkage(hdf, 'ward')
c, coph_dists = cophenet(Z , pdist(hdf))
c
plt.figure(figsize=(15, 15))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold=600, leaf_font_size=10. )
plt.tight_layout()
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean')
model.fit(hdf)
hdf['labels'] = model.labels_
hdf.groupby(["labels"]).count()
hdf_clusters = hdf.groupby(['labels'])
print(hdf_clusters)
Z = linkage(hdf, 'average')
c, coph_dists = cophenet(Z , pdist(hdf))
c
plt.figure(figsize=(10, 10))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
Z = linkage(hdf, 'complete')
c, coph_dists = cophenet(Z , pdist(hdf))
c
plt.figure(figsize=(10, 10))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
Z = linkage(hdf, 'ward')
c, coph_dists = cophenet(Z , pdist(hdf))
c
plt.figure(figsize=(10, 10))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
# Importing necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import zscore
pcdf=dataset.copy()
for feature in pcdf.columns: # Loop through all columns in the dataframe
if pcdf[feature].dtype == 'object': # Only apply for columns with categorical strings
pcdf[feature] = pd.Categorical(pcdf[feature]).codes # Replace strings with an integer
pcdf.head()
# Note bus is replaced with 0 , car with 1 and van with 2 in class column
# separating the target column and the other independent columns
X = pcdf.iloc[:,0:18]
y = pcdf.iloc[:,18]
sns.pairplot(pcdf, diag_kind='kde')
# Note:- From pair plot we see that the independent columns have a strong correlation among them , they are impacting one another which means lot of redundant information will be fed to the model. Hence to overcome redundancy and reduce loss of information we use PCA
# We transform (centralize) the entire X (independent variable data) to zscores through transformation. We will create the PCA dimensions
# on this distribution.
sc = StandardScaler()
X_std = sc.fit_transform(X)
cov_matrix = np.cov(X_std.T)
print('Covariance Matrix \n%s', cov_matrix)
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eigenvectors)
print('\n Eigen Values \n%s', eigenvalues)
# Step 3 (continued): Sort eigenvalues in descending order
# Make a set of (eigenvalue, eigenvector) pairs
eig_pairs = [(eigenvalues[index], eigenvectors[:,index]) for index in range(len(eigenvalues))]
# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eig_pairs.sort()
eig_pairs.reverse()
print(eig_pairs)
# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eig_pairs[index][0] for index in range(len(eigenvalues))]
eigvectors_sorted = [eig_pairs[index][1] for index in range(len(eigenvalues))]
# Let's confirm our sorting worked, print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigvalues_sorted)
tot = sum(eigenvalues)
var_explained = [(i / tot) for i in sorted(eigenvalues, reverse=True)] # an array of variance explained by each
# eigen vector... there will be 8 entries as there are 8 eigen vectors)
cum_var_exp = np.cumsum(var_explained) # an array of cumulative variance. There will be 8 entries with 8 th entry
# cumulative reaching almost 100%
cum_var_exp.size
plt.bar(range(1,19), var_explained, alpha=0.5, align='center', label='individual explained variance')
plt.step(range(1,19),cum_var_exp, where= 'mid', label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc = 'best')
plt.show()
# From graph we find that most of the information are contained in first 6 Principal Components
# P_reduce represents reduced mathematical space....
P_reduce = np.array(eigvectors_sorted[0:6]) # Reducing from 18 to 6 dimension space
X_std_6D = np.dot(X_std,P_reduce.T) # projecting original data into principal component dimensions
Proj_data_df = pd.DataFrame(X_std_6D) # converting array to dataframe for pairplot
#Let us check it visually
sns.pairplot(Proj_data_df, diag_kind='kde')
# As expected now the spread of the off diagonal data points in the mathematical space is almost spherical
from sklearn import model_selection
test_size = 0.30 # taking 70:30 training and test set
seed = 7 # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
## We will use the Naive Bayes & Support Vector Classifiers
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV
model = SVC()
params = {'C': [0.01, 0.1, 0.5, 1], 'kernel': ['linear', 'rbf']}
model1 = GridSearchCV(model, param_grid=params, verbose=5)
model1.fit(X_train, y_train)
print("Best Hyper Parameters:\n", model1.best_params_)
# Best Hyper Parameters:
# {'C': 1, 'kernel': 'rbf'}
# calculate accuracy measures and confusion matrix
from sklearn import metrics
#Build the model with the best hyper parameters
svc_model = SVC(C=1, kernel="rbf")
# Fitting the mode
svc_model.fit(X_train, y_train)
#Prediction on test set
prediction = svc_model.predict(X_test)
# Accuracy on test set
accuracy = svc_model.score(X_test, y_test)
expected=y_test
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
# We get an overall score of 93.7% with a good recall value
model = GaussianNB()
iterationList=np.random.randint(1,100,10)
itr = 1
for i in iterationList:
seed = i
X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
# Fitting the mode
model.fit(X_train, y_train)
#Prediction on test set
prediction = model.predict(X_test)
# Accuracy on test set
accuracy = model.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
# For Naive Bayes we get a highest overall score of 79% for data split random state = 36
P_reduce = np.array(eigvectors_sorted[0:9]) # increasing from 6 to 9 dimension space
X_std_9D = np.dot(X_std,P_reduce.T) # projecting original data into principal component dimensions
Proj_data_df = pd.DataFrame(X_std_9D) # converting array to dataframe for pairplot
sns.pairplot(Proj_data_df, diag_kind='kde')
from sklearn import model_selection
test_size = 0.30 # taking 70:30 training and test set
seed = 7 # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV
model = SVC()
params = {'C': [0.01, 0.1, 0.5, 1], 'kernel': ['linear', 'rbf']}
model1 = GridSearchCV(model, param_grid=params, verbose=5)
model1.fit(X_train, y_train)
print("Best Hyper Parameters:\n", model1.best_params_)
# Best Hyper Parameters:
# {'C': 0.5, 'kernel': 'rbf'}
#Build the model with the best hyper parameters
svc_model = SVC(C=0.5, kernel="rbf")
# Fitting the mode
svc_model.fit(X_train, y_train)
#Prediction on test set
prediction = svc_model.predict(X_test)
# Accuracy on test set
accuracy = svc_model.score(X_test, y_test)
expected=y_test
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
# The overall score increased to 97.2% when dimensions are increased from 6 to 9
model = GaussianNB()
iterationList=np.random.randint(1,100,10)
itr = 1
for i in iterationList:
seed = i
X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
# Fitting the mode
model.fit(X_train, y_train)
#Prediction on test set
prediction = model.predict(X_test)
# Accuracy on test set
accuracy = model.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
# Now we get highest score Naive Bayes for random state 11 with overall score of 85%
P_reduce = np.array(eigvectors_sorted) # taking all the dimensions
X_std_all = np.dot(X_std,P_reduce.T) # projecting original data into principal component dimensions
Proj_data_df = pd.DataFrame(X_std_all) # converting array to dataframe for pairplot
sns.pairplot(Proj_data_df, diag_kind='kde')
from sklearn import model_selection
test_size = 0.30 # taking 70:30 training and test set
seed = 7 # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV
model = SVC()
params = {'C': [0.01, 0.1, 0.5, 1], 'kernel': ['linear', 'rbf']}
model1 = GridSearchCV(model, param_grid=params, verbose=5)
model1.fit(X_train, y_train)
print("Best Hyper Parameters:\n", model1.best_params_)
#Build the model with the best hyper parameters
svc_model = SVC(C=1, kernel="rbf")
# Fitting the mode
svc_model.fit(X_train, y_train)
#Prediction on test set
prediction = svc_model.predict(X_test)
# Accuracy on test set
accuracy = svc_model.score(X_test, y_test)
expected=y_test
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
# The overall score is 97.6 % . This is almost near to the score that we get while choosing only 9 dimensions, which shows that most of the information are stored in the first 9 principal components chosen earlier
model = GaussianNB()
iterationList=np.random.randint(1,100,10)
itr = 1
for i in iterationList:
seed = i
X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
# Fitting the mode
model.fit(X_train, y_train)
#Prediction on test set
prediction = model.predict(X_test)
# Accuracy on test set
accuracy = model.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
# The highest score obtained from Naive Bayes this time is 87% for random state = 29
# SVC
test_size = 0.30 # taking 70:30 training and test set
seed = 7 # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed)
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV
model = SVC()
params = {'C': [0.01, 0.1, 0.5, 1], 'kernel': ['linear', 'rbf']}
model1 = GridSearchCV(model, param_grid=params, verbose=5)
model1.fit(X_train, y_train)
print("Best Hyper Parameters:\n", model1.best_params_)
#Build the model with the best hyper parameters
svc_model = SVC(C=0.01, kernel="linear")
# Fitting the mode
svc_model.fit(X_train, y_train)
#Prediction on test set
prediction = svc_model.predict(X_test)
# Accuracy on test set
accuracy = svc_model.score(X_test, y_test)
expected=y_test
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
# Taking all the independent variables we do SVC and find the overall score as 93% which is less than the score that we get when we use PCA.
model = GaussianNB()
iterationList=np.random.randint(1,100,10)
itr = 1
for i in iterationList:
seed = i
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed)
# Fitting the mode
model.fit(X_train, y_train)
#Prediction on test set
prediction = model.predict(X_test)
# Accuracy on test set
accuracy = model.score(X_test, y_test)
expected=y_test
print("Iteration ",itr)
itr=itr+1
print()
print("data split random state ",seed)
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
print("----------------------------------------------------")
# The highest score for this Naive Bayes without PCA is 67% which is very less compared to the results that we get when we use PCA.
From the Principal Component Analysis we come to know that optimum number of dimensions that can be used for building a model is 9 , the same we get from the summary plot of eigen values sorted in descending order. Again from SVC and Naive Bayes we find that there is some considerable increase in the model performance/score when the dimensions are increased from 6 (93%) to 9 (97%). However the increase in performance is very low or almost negligible when all the 18 dimensions are taken into consideration. Also we noticed that the performance of model drops significantly when we dont use PCA. Hence from this project we infer that when there is a strong correlation among the independent variables/dimensions then PCA is the best choice as it helps to captures the covariance information and helps us to choose optimum number of dimensions resulting in increase in model performance.